Pandas and Scikit-learn

Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Think of Pandas as a Python version of Excel. Scikit-learn, on the other hand, is an open-source machine learning library for Python.

While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit.

Kaggle

Kaggle is the leading platform for data science competitions. Participants compete for cash prizes by submitting the best predictive model to problems posted on the competition website.

https://www.kaggle.com/competitions

Learning machine learning via Kaggle problems allows us to take a highly-directed approach because:

  1. The problems are well-defined and the data is provided, allowing us to immediately focus on manipulating the data, and
  2. The leaderboard allows us to keep track of how well we're doing.

In the following set of exercises, we will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class.

Section 1-0 - First Cut

We will start by processing the training data, after which we will be able to use to 'train' (or 'fit') our model. With the trained model, we apply it to the test data to make the predictions. Finally, we output our predictions into a .csv file to make a submission to Kaggle and see how well they perform.

It is very common to encounter missing values in a data set. In this section, we will take the simplest (or perhaps, simplistic) approach of ignoring the whole row if any part of it contains an NaN value. We will build on this approach in later sections.

Pandas - Extracting data

First, we load the training data from a .csv file. This is the similar to the data found on the Kaggle website:

https://www.kaggle.com/c/titanic-gettingStarted/data


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/train.csv')

Pandas - Cleaning data

We then review a selection of the data.


In [2]:
df.head(10)


Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C

We notice that the columns describe features of the Titanic passengers, such as age, sex, and class. Of particular interest is the column Survived, which describes whether or not the passenger survived. When training our model, what we are essentially doing is assessing how each feature impacts whether or not the passenger survived (or if the feature makes an impact at all).

Exercise:

  • Write the code to review the tail-end section of the data.

We observe that the columns Name, Ticket and Cabin are, for our current purposes, irrelevant. We proceed to remove them from our data set.


In [3]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

Next, we review the type of data in the columns, and their respective counts.


In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)

We notice that the columns Age and Embarked have NaNs or missing values. As previously discussed, we take the approach of simply removing the rows with missing values.


In [5]:
df = df.dropna()

Question

  • If you were to fill in the missing values, with what values would you fill them with? Why?

Scikit-learn only takes numerical arrays as inputs. As such, we would need to convert the categorical columns Sex and Embarked into numerical ones. We first review the range of values for the column Sex, and create a new column that represents the data as numbers.


In [6]:
df['Sex'].unique()


Out[6]:
array(['male', 'female'], dtype=object)

In [7]:
df['Gender'] = df['Sex'].map({'female': 0, 'male':1}).astype(int)

Similarly for Embarked, we review the range of values and create a new column called Port that represents, as a numerical value, where each passenger embarks from.


In [8]:
df['Embarked'].unique()


Out[8]:
array(['S', 'C', 'Q'], dtype=object)

In [9]:
df['Port'] = df['Embarked'].map({'C':1, 'S':2, 'Q':3}).astype(int)

Question

  • What problems might we encounter by mapping C, S, and Q in the column Embarked to the values 1, 2, and 3? In other words, what does the ordering imply? Does the same problem exist for the column Sex?

Now that we have numerical columns that encapsulate the information provided by the columns Sex and Embarked, we can proceed to drop them from our data set.


In [10]:
df = df.drop(['Sex', 'Embarked'], axis=1)

We review the columns our final, processed data set.


In [11]:
cols = df.columns.tolist()
print(cols)


['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Gender', 'Port']

For convenience, we move the column Survived to the left-most column. We note that the left-most column is indexed as 0.


In [12]:
cols = [cols[1]] + cols[0:1] + cols[2:]
df = df[cols]

In our final review of our training data, we check that (1) the column Survived is the left-most column (2) there are no NaN values, and (3) all the values are in numerical form.


In [13]:
df.head(10)


Out[13]:
Survived PassengerId Pclass Age SibSp Parch Fare Gender Port
0 0 1 3 22 1 0 7.2500 1 2
1 1 2 1 38 1 0 71.2833 0 1
2 1 3 3 26 0 0 7.9250 0 2
3 1 4 1 35 1 0 53.1000 0 2
4 0 5 3 35 0 0 8.0500 1 2
6 0 7 1 54 0 0 51.8625 1 2
7 0 8 3 2 3 1 21.0750 1 2
8 1 9 3 27 0 2 11.1333 0 2
9 1 10 2 14 1 0 30.0708 0 1
10 1 11 3 4 1 1 16.7000 0 2

In [14]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
Survived       712 non-null int64
PassengerId    712 non-null int64
Pclass         712 non-null int64
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Fare           712 non-null float64
Gender         712 non-null int64
Port           712 non-null int64
dtypes: float64(2), int64(7)

Finally, we convert the processed training data from a Pandas dataframe into a numerical (Numpy) array.


In [15]:
train_data = df.values

Scikit-learn - Training the model

In this section, we'll simply use the model as a black box. We'll review more sophisticated techniques in later sections.

Here we'll be using the Random Forest model. The intuition is as follows: each feature is reviewed to see how much impact it makes to the outcome. The most prominent feature is segmented into a 'branch'. A collection of branches is a 'tree'. The Random Forest model, broadly speaking, creates a 'forest' of trees and aggregates the results.

http://en.wikipedia.org/wiki/Random_forest


In [16]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100)

We use the processed training data to 'train' (or 'fit') our model. The column Survived will be our second input, and the set of other features (with the column PassengerId omitted) as the first.


In [17]:
model = model.fit(train_data[0:,2:], train_data[0:,0])

Scikit-learn - Making predictions

We first load the test data.


In [18]:
df_test = pd.read_csv('../data/test.csv')

We then review a selection of the data.


In [19]:
df_test.head(10)


Out[19]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
5 897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 NaN S
6 898 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q
7 899 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29.0000 NaN S
8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN C
9 901 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.1500 NaN S

We notice that test data has columns similar to our training data, but not the column Survived. We'll use our trained model to predict values for the column Survived.

As before, we process the test data in a similar fashion to what we did to the training data.


In [20]:
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

df_test = df_test.dropna()

df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male':1})
df_test['Port'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})

df_test = df_test.drop(['Sex', 'Embarked'], axis=1)

test_data = df_test.values

We now apply the trained model to the test data (omitting the column PassengerId) to produce an output of predictions.


In [21]:
output = model.predict(test_data[:,1:])

Pandas - Preparing for submission

We simply create a Pandas dataframe by combining the index from the test data with the output of predictions.


In [22]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]
df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])

We briefly review our predictions.


In [23]:
df_result.head(10)


Out[23]:
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 1
4 896 0
5 897 0
6 898 0
7 899 0
8 900 1
9 901 0

Finally, we output our results to a .csv file.


In [24]:
df_result.to_csv('../results/titanic_1-0.csv', index=False)

However, it appears that we have a problem. The Kaggle submission website expects "the solution file to have 418 predictions."

https://www.kaggle.com/c/titanic-gettingStarted/submissions/attach

We compare this to our result.


In [25]:
df_result.shape


Out[25]:
(331, 2)

Since we eliminated the rows containing NaNs, we end up with a set of predictions with a smaller number of rows compared to the test data. As Kaggle requires all 418 predictions, we are unable to make a submission.

In this section, we took the simplest approach of ignoring missing values, but fail to produce a complete set of predictions. We look to build on this approach in Section 1-1.